accurate stochastic gradient estimation
Fast and Accurate Stochastic Gradient Estimation
Stochastic Gradient Descent or SGD is the most popular optimization algorithm for large-scale problems. SGD estimates the gradient by uniform sampling with sample size one. There have been several other works that suggest faster epoch-wise convergence by using weighted non-uniform sampling for better gradient estimates. Unfortunately, the per-iteration cost of maintaining this adaptive distribution for gradient estimation is more than calculating the full gradient itself, which we call the chicken-and-the-egg loop. As a result, the false impression of faster convergence in iterations, in reality, leads to slower convergence in time. In this paper, we break this barrier by providing the first demonstration of a scheme, Locality sensitive hashing (LSH) sampled Stochastic Gradient Descent (LGD), which leads to superior gradient estimation while keeping the sampling cost per iteration similar to that of the uniform sampling. Such an algorithm is possible due to the sampling view of LSH, which came to light recently. As a consequence of superior and fast estimation, we reduce the running time of all existing gradient descent algorithms, that relies on gradient estimates including Adam, Ada-grad, etc. We demonstrate the effectiveness of our proposal with experiments on linear models as well as the non-linear BERT, which is a recent popular deep learning based language representation model.
Reviews: Fast and Accurate Stochastic Gradient Estimation
Summary: This paper develops a new method for adaptively sampling training examples during stochastic optimization. It is known that the optimal distribution that minimizes the nuclear norm of the covariance of the gradient estimate is one where the the probability of sampling an example is proportional to the magnitude of the gradient of the loss on that example. Sampling according to this distribution is of course impractical, because computing this distribution is as expensive as computing the full gradient and requires O(N) time per iteration. To get around this, prior work either maintains a fixed distribution across all iterations or makes strong assumptions on the distribution of gradients of different training examples (e.g.: the gradients of training examples of the same class are similar). This paper proposes a method that can adaptively sample from different distributions every iteration and requires little assumptions on the distribution of gradients, and yet requires the same per-iteration cost as SGD.
Reviews: Fast and Accurate Stochastic Gradient Estimation
This paper received extensive discussion by the reviewers, the meta-reviewer, the SPC, etc. Here is a meta-review summary. The paper considers the problem of adaptively sampling training examples in stochastic optimization, and it shows that it is possible to do so without a per-iteration cost of O(N). This is of interest by itself, since one typically thinks that such sampling requires maintaining a distribution over training examples, which requires O(N) in every iteration, i.e., which is as expensive as full-batch gradient descent. A second aspect of this paper is that the mechanism by which the authors accomplish this is to use LSH, which is a sketching method usually used for nearest neighbor search.
Fast and Accurate Stochastic Gradient Estimation
Stochastic Gradient Descent or SGD is the most popular optimization algorithm for large-scale problems. SGD estimates the gradient by uniform sampling with sample size one. There have been several other works that suggest faster epoch-wise convergence by using weighted non-uniform sampling for better gradient estimates. Unfortunately, the per-iteration cost of maintaining this adaptive distribution for gradient estimation is more than calculating the full gradient itself, which we call the chicken-and-the-egg loop. As a result, the false impression of faster convergence in iterations, in reality, leads to slower convergence in time.
Fast and Accurate Stochastic Gradient Estimation
Chen, Beidi, Xu, Yingchen, Shrivastava, Anshumali
Stochastic Gradient Descent or SGD is the most popular optimization algorithm for large-scale problems. SGD estimates the gradient by uniform sampling with sample size one. There have been several other works that suggest faster epoch-wise convergence by using weighted non-uniform sampling for better gradient estimates. Unfortunately, the per-iteration cost of maintaining this adaptive distribution for gradient estimation is more than calculating the full gradient itself, which we call the chicken-and-the-egg loop. As a result, the false impression of faster convergence in iterations, in reality, leads to slower convergence in time.